47 research outputs found

    A framework for multidimensional indexes on distributed and highly-available data stores

    Get PDF
    No-relational databases are nowadays a common solution when dealing with a huge data set and massive query workload. These systems have been redesigned from scratch in order to achieve scalability and availability at the cost of providing only a reduce set of low-level functionality, thus forcing the client application to implement complex logic. As a solution, our research group developed Hecuba, a set of tools and interfaces, which aims to facilitate developers with an efficient and painless interaction with non-relational technologies. This paper presents a part of Hecuba related to a particular missing feature: multidimensional indexing. Our work focuses on the design of architectures and the algorithms for providing multidimensional indexing on a distributed database without compromising scalability and availability

    Mejora del rendimiento de las aplicaciones Java usando cooperación entre el sistema operativo y la máquina virtual de Java

    Get PDF
    El uso de los entornos virtualizados de ejecución se ha extendido a todos los ámbitos y, en particular, se está utilizando para el desarrollo y la ejecución de aplicaciones con un alto consumo de recursos. Por lo tanto, se hace necesario evaluar si estas plataformas ofrecen un rendimiento adecuado para este tipo de programas y si es posible aprovechar las características de estas plataformas para favorecer su ejecución.El objetivo principal de este trabajo ha sido ha sido demostrar que es posible explotar las características de los entornos virtualizados de ejecución para ofrecer a los programas una gestión de recursos que se adapte mejor a sus características.En este trabajo demostramos que el modelo de ejecución de este tipo de entornos, basado en la ejecución sobre máquinas virtuales, ofrece una nueva oportunidad para implementar una gestión específica de recursos, que permite mejorar el rendimiento de los programas sin renunciar a las numerosas ventajas de este tipo de plataformas como, por ejemplo, una portabilidad total del código de los programas.Para demostrar los beneficios de esta estrategia hemos seleccionado como caso de estudio la gestión del recurso memoria para los programas de cálculo científico en el entorno de ejecución de Java. Después de un análisis detallado de la influencia que tiene la gestión de memoria sobre este tipo de programas, hemos visto que añadir en el entorno de ejecución una política de prefetch de páginas que se adapte al comportamiento de los programas es una posible vía para mejorar su rendimiento.Por este motivo, hemos analizado detalladamente los requerimientos que debe cumplir esta política y cómo repartir las tareas entre los diferentes componentes del entorno de ejecución de Java para cumplir estos requerimientos.Como consecuencia, hemos diseñado una política de prefetch basada en la cooperación entre la máquina virtual y el sistema operativo. En nuestra propuesta, por un lado, las decisiones de prefetch se llevan a cabo utilizando todo el conocimiento que la máquina virtual tiene sobre el comportamiento dinámico de los programas y el conocimiento que el sistema operativo tiene sobre las condiciones de ejecución. Por otro lado, el encargado de llevar a cabo las decisiones de gestión es el sistema operativo, lo que garantiza la fiabilidad de la máquina.Además, esta estrategia es totalmente transparente al programador y al usuario, respetando el paradigma de portabilidad de los entornos de ejecución virtualizados.Hemos implementado y evaluado esta estrategia para demostrar los beneficios que ofrece al tipo de programas seleccionado y, aunque estos beneficios dependen de las características del programa, la mejora del rendimiento ha alcanzado hasta un 40% si se compara con el rendimiento obtenido sobre el entorno original de ejecución.Postprint (published version

    Hecuba: NoSql made easy

    Get PDF
    Non-relational databases are nowadays a common solution when dealing with huge data set and massive query work load. These systems have been redesigned from scratch in order to achieve scalability and availability at the cost of providing only a reduce set of low-level functionality, thus forcing the client application to take care of complex logics. As a solution, our research group developed Hecuba, a set of tools and interfaces, which aims to facilitate programmers with an efficient and easy interaction with non-relational technologies

    Introducing polyglot-based data-flow awareness to time-series data stores

    Get PDF
    The rising interest in extracting value from data has led to a broad proliferation of monitoring infrastructures, most notably composed by sensors, intended to collect this new oil. Thus, gathering data has become fundamental for a great number of applications, such as predictive maintenance techniques or anomaly detection algorithms. However, before data can be refined into insights and knowledge, it has to be efficiently stored and prepared for its later retrieval. As a consequence of this sensor and IoT boom, Time-Series databases (TSDB), designed to manage sensor data, became the fastest-growing database category since 2019. Here we propose a holistic approach intended to improve TSDB’s performance and efficiency. More precisely, we introduce and evaluate a novel polyglot-based approximation, aimed to tailor the data store, not only to time-series data –as it is done conventionally– but also to the data flow itself: From its ingestion, until its retrieval. In order to evaluate the approach, we materialize it in an alternative implementation of NagareDB, a resource-efficient time-series database, based on MongoDB, in turn, the most popular NoSQL storage solution. After implementing our approach into the database, we observe a global speed up, solving queries up to 12 times faster than MongoDB’s recently launched Time-series capability, as well as generally outperforming InfluxDB, the most popular time-series database. Our polyglot-based data-flow aware solution can ingest data more than two times faster than MongoDB, InfluxDB, and NagareDB’s original implementation, while using the same disk space as InfluxDB, and half of the requested by MongoDB.This research was partly supported by the Spanish Ministry of Science and Innovation (contract PID2019-107255GB) and by the Generalitat de Catalunya (contract 2017-SGR-1414).Peer ReviewedPostprint (published version

    A holistic scalability strategy for time series databases following cascading polyglot persistence

    Get PDF
    Time series databases aim to handle big amounts of data in a fast way, both when introducing new data to the system, and when retrieving it later on. However, depending on the scenario in which these databases participate, reducing the number of requested resources becomes a further requirement. Following this goal, NagareDB and its Cascading Polyglot Persistence approach were born. They were not just intended to provide a fast time series solution, but also to find a great cost-efficiency balance. However, although they provided outstanding results, they lacked a natural way of scaling out in a cluster fashion. Consequently, monolithic approaches could extract the maximum value from the solution but distributed ones had to rely on general scalability approaches. In this research, we proposed a holistic approach specially tailored for databases following Cascading Polyglot Persistence to further maximize its inherent resource-saving goals. The proposed approach reduced the cluster size by 33%, in a setup with just three ingestion nodes and up to 50% in a setup with 10 ingestion nodes. Moreover, the evaluation shows that our scaling method is able to provide efficient cluster growth, offering scalability speedups greater than 85% in comparison to a theoretically 100% perfect scaling, while also ensuring data safety via data replication.This research was partly supported by the Grant Agreement No. 857191, by the Spanish Ministry of Science and Innovation (contract PID2019-107255GB) and by the Generalitat de Catalunya (contract 2017-SGR-1414).Peer ReviewedPostprint (published version

    A compromise archive platform for monitoring infrastructures

    Get PDF
    The great advancement in the technological field has led to an explosion in the amount of generated data. Many different sectors have understood the opportunity that acquiring, storing, and analyzing further information means, which has led to a broad proliferation of measurement devices. Those sensors’ typical job is to monitor the state of the enterprise ecosystem, which can range from a traditional factory, to a commercial mall, or even to the largest experiment on Earth[1]. Big enterprises (BEs) are building their own big data architectures, usually made out of a combination of several state-of-the-art technologies. Finding new interesting data to measure, store and analyze, has become a daily process in the industrial field. However, small and medium-sized enterprises (SMEs) usually lack the resources needed to build those data handling architectures, not just in terms of hardware resources, but also in terms of contracting personnel who can master all those rapidly evolving technologies. Our research tries to adapt two world-wide-used technologies into a single but elastic and moldable one, by tuning them, to offer an alternative and efficient solution for this very specific, but common, scenario

    Evaluating the benefits of key-value databases for scientific applications

    Get PDF
    The convergence of Big Data applications with High-Performance Computing requires new methodologies to store, manage and process large amounts of information. Traditional storage solutions are unable to scale and that results in complex coding strategies. For example, the brain atlas of the Human Brain Project has the challenge to process large amounts of high-resolution brain images. Given the computing needs, we study the effects of replacing a traditional storage system with a distributed Key-Value database on a cell segmentation application. The original code uses HDF5 files on GPFS through an intricate interface, imposing synchronizations. On the other hand, by using Apache Cassandra or ScyllaDB through Hecuba, the application code is greatly simplified. Thanks to the Key-Value data model, the number of synchronizations is reduced and the time dedicated to I/O scales when increasing the number of nodes.This project/research has received funding from the European Unions Horizon 2020 Framework Programme for Research and Innovation under the Speci c Grant Agreement No. 720270 (Human Brain Project SGA1) and the Speci c Grant Agreement No. 785907 (Human Brain Project SGA2). This work has also been supported by the Spanish Government (SEV2015-0493), by the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P), and by Generalitat de Catalunya (contract 2017-SGR-1414).Postprint (author's final draft

    Aeneas: A tool to enable applications to effectively use non-relational databases

    Get PDF
    Non-relational databases arise as a solution to solve the scalability problems of relational databases when dealing with big data applications. However, they are highly configurable prone to user decisions that can heavily affect their performance. In order to maximize the performance, different data models and queries should be analyzed to choose the best fit. This may involve a wide range of tests and may result in productivity issues. We present Aeneas, a tool to support the design of data management code for applications using non-relational databases. Aeneas provides an easy and fast methodology to support the decision about how to organize and retrieve data in order to improve the performance.Peer ReviewedPostprint (author’s final draft

    Exploiting key-value data stores scalability for HPC

    Get PDF
    BigData revolutionised the IT industry. It first interested the OLTP systems. Distributed Hash Tables replaced Traditional SQL databases as they guaranteed low response time on simple read/write requests. The second wave recast the data warehousing: map-reduce systems spread as they proved to scale linearly long-running computational workloads on commodity servers. The focus now is on real-time analytics. Being able to analyse massive quantities of data in a short time enables multiple HPC applications and interactive analysis and visualization. In this paper, we study the performance of a system that employs the DHT architecture to achieve fast in local analysis on indexed data. We observed that the number of keys, nodes, and the hardware characteristics strongly influence the actual scalability of the system. Therefore, we developed a mathematical model that allows finding the right system configuration to meet desired performance for each kind of query type. We also show how our model can be used to find the right architecture for each distributed application.This work has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 720270 (HBP SGA1). It is also partially supported by grant SEV-2011-00067 of the Severo Ochoa Program awarded by the Spanish Government, the TIN2015-65316-P project, with funding from the Spanish Ministry of Economy and Competitivity, the European Union FEDER funds, and the SGR 2014-SGR-1051.Peer ReviewedPostprint (author's final draft

    Automatic query driven data modelling in Cassandra

    Get PDF
    Non-relational databases have recently been the preferred choice when it comes to dealing with Big Data challenges, but their performance is very sensitive to the chosen data organisations. We have seen differences of over 70 times in response time for the same query on different models. This brings users the need to be fully conscious of the queries they intend to serve in order to design their data model. The common practice then, is to replicate data into different models designed to fit different query requirements. In this scenario, the user is in charge of the code implementation required to keep consistency between the different data replicas. Manually replicating data in such high layers of the database results in a lot of squandered storage due to the underlying system replication mechanisms that are formerly designed for availability and reliability ends. We propose and design a mechanism and a prototype to provide users with transparent management, where queries are matched with a well-performing model option. Additionally, we propose to do so by transforming the replication mechanism into a heterogeneous replication one, in order to avoid squandering disk space while keeping the availability and reliability features. The result is a system where, regardless of the query or model the user specifies, response time will always be that of an affine query
    corecore